{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<h1> Structured data prediction using Cloud ML Engine with scikit-learn </h1>\n",
    "\n",
    "This notebook illustrates:\n",
    "<ol>\n",
    "<li> Creating datasets for Machine Learning using BigQuery\n",
    "<li> Creating a model using scitkit learn \n",
    "<li> Training on Cloud ML Engine\n",
    "<li> Deploying model\n",
    "<li> Predicting with model\n",
    "<li> Hyperparameter tuning of scikit-learn models\n",
    "</ol>\n",
    "\n",
    "Please see [this notebook](../babyweight/babyweight.ipynb) for more context on this problem and how the features were chosen."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# change these to try this notebook out\n",
    "BUCKET = 'cloud-training-demos-ml'\n",
    "PROJECT = 'cloud-training-demos'\n",
    "PROJECTNUMBER = '663413318684'\n",
    "REGION = 'us-central1'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ['BUCKET'] = BUCKET\n",
    "os.environ['PROJECT'] = PROJECT\n",
    "os.environ['PROJECTNUMBER'] = PROJECTNUMBER\n",
    "os.environ['REGION'] = REGION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Updated property [core/project].\n",
      "Updated property [compute/region].\n"
     ]
    }
   ],
   "source": [
    "%bash\n",
    "gcloud config set project $PROJECT\n",
    "gcloud config set compute/region $REGION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "if ! gcloud storage ls | grep -q gs://${BUCKET}/; then\n",    "  gcloud storage buckets create --location ${REGION} gs://${BUCKET}\n",    "fi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 138,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "%bash\n",
    "# Pandas will use this privatekey to access BigQuery on our behalf.\n",
    "# Do NOT check in the private key into git!!!\n",
    "# if you get a JWT grant error when using this key, create the key via gcp web console in IAM > Service Accounts section\n",
    "KEYFILE=babyweight/trainer/privatekey.json\n",
    "if [ ! -f $KEYFILE ]; then\n",
    "  gcloud iam service-accounts keys create \\\n",
    "      --iam-account ${PROJECTNUMBER}-compute@developer.gserviceaccount.com \\\n",
    "      $KEYFILE\n",
    "fi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "KEYDIR='babyweight/trainer'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Exploring dataset\n",
    "\n",
    "Please see [this notebook](../babyweight/babyweight.ipynb) for more context on this problem and how the features were chosen."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "#%writefile babyweight/trainer/model.py\n",
    "\n",
    "# Copyright 2018 Google Inc. All Rights Reserved.\n",
    "#\n",
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "#     http://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<h2> Creating a ML dataset using BigQuery </h2>\n",
    "\n",
    "We can use BigQuery to create the training and evaluation datasets. Because of the masking (ultrasound vs. no ultrasound), the query itself is a little complex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "#%writefile -a babyweight/trainer/model.py\n",
    "def create_queries():\n",
    "  query_all = \"\"\"\n",
    "  WITH with_ultrasound AS (\n",
    "    SELECT\n",
    "      weight_pounds AS label,\n",
    "      CAST(is_male AS STRING) AS is_male,\n",
    "      mother_age,\n",
    "      CAST(plurality AS STRING) AS plurality,\n",
    "      gestation_weeks,\n",
    "      FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth\n",
    "    FROM\n",
    "      publicdata.samples.natality\n",
    "    WHERE\n",
    "      year > 2000\n",
    "      AND gestation_weeks > 0\n",
    "      AND mother_age > 0\n",
    "      AND plurality > 0\n",
    "      AND weight_pounds > 0\n",
    "  ),\n",
    "\n",
    "  without_ultrasound AS (\n",
    "    SELECT\n",
    "      weight_pounds AS label,\n",
    "      'Unknown' AS is_male,\n",
    "      mother_age,\n",
    "      IF(plurality > 1, 'Multiple', 'Single') AS plurality,\n",
    "      gestation_weeks,\n",
    "      FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth\n",
    "    FROM\n",
    "      publicdata.samples.natality\n",
    "    WHERE\n",
    "      year > 2000\n",
    "      AND gestation_weeks > 0\n",
    "      AND mother_age > 0\n",
    "      AND plurality > 0\n",
    "      AND weight_pounds > 0\n",
    "  ),\n",
    "\n",
    "  preprocessed AS (\n",
    "    SELECT * from with_ultrasound\n",
    "    UNION ALL\n",
    "    SELECT * from without_ultrasound\n",
    "  )\n",
    "\n",
    "  SELECT\n",
    "      label,\n",
    "      is_male,\n",
    "      mother_age,\n",
    "      plurality,\n",
    "      gestation_weeks\n",
    "  FROM\n",
    "      preprocessed\n",
    "  \"\"\"\n",
    "\n",
    "  train_query = \"{} WHERE ABS(MOD(hashmonth, 4)) < 3\".format(query_all)\n",
    "  eval_query  = \"{} WHERE ABS(MOD(hashmonth, 4)) = 3\".format(query_all)\n",
    "  return train_query, eval_query"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "  WITH with_ultrasound AS (\n",
      "    SELECT\n",
      "      weight_pounds AS label,\n",
      "      CAST(is_male AS STRING) AS is_male,\n",
      "      mother_age,\n",
      "      CAST(plurality AS STRING) AS plurality,\n",
      "      gestation_weeks,\n",
      "      FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth\n",
      "    FROM\n",
      "      publicdata.samples.natality\n",
      "    WHERE\n",
      "      year > 2000\n",
      "      AND gestation_weeks > 0\n",
      "      AND mother_age > 0\n",
      "      AND plurality > 0\n",
      "      AND weight_pounds > 0\n",
      "  ),\n",
      "\n",
      "  without_ultrasound AS (\n",
      "    SELECT\n",
      "      weight_pounds AS label,\n",
      "      'Unknown' AS is_male,\n",
      "      mother_age,\n",
      "      IF(plurality > 1, 'Multiple', 'Single') AS plurality,\n",
      "      gestation_weeks,\n",
      "      FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth\n",
      "    FROM\n",
      "      publicdata.samples.natality\n",
      "    WHERE\n",
      "      year > 2000\n",
      "      AND gestation_weeks > 0\n",
      "      AND mother_age > 0\n",
      "      AND plurality > 0\n",
      "      AND weight_pounds > 0\n",
      "  ),\n",
      "\n",
      "  preprocessed AS (\n",
      "    SELECT * from with_ultrasound\n",
      "    UNION ALL\n",
      "    SELECT * from without_ultrasound\n",
      "  )\n",
      "\n",
      "  SELECT\n",
      "      label,\n",
      "      is_male,\n",
      "      mother_age,\n",
      "      plurality,\n",
      "      gestation_weeks\n",
      "  FROM\n",
      "      preprocessed\n",
      "   WHERE ABS(MOD(hashmonth, 4)) < 3\n"
     ]
    }
   ],
   "source": [
    "print create_queries()[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "#%writefile -a babyweight/trainer/model.py\n",
    "def query_to_dataframe(query):\n",
    "  import pandas as pd\n",
    "  import pkgutil\n",
    "  privatekey = pkgutil.get_data(KEYDIR, 'privatekey.json')\n",
    "  print(privatekey[:200])\n",
    "  return pd.read_gbq(query,\n",
    "                     project_id=PROJECT,\n",
    "                     dialect='standard',\n",
    "                     private_key=privatekey)\n",
    "\n",
    "def create_dataframes(frac):  \n",
    "  # small dataset for testing\n",
    "  if frac > 0 and frac < 1:\n",
    "    sample = \" AND RAND() < {}\".format(frac)\n",
    "  else:\n",
    "    sample = \"\"\n",
    "\n",
    "  train_query, eval_query = create_queries()\n",
    "  train_query = \"{} {}\".format(train_query, sample)\n",
    "  eval_query =  \"{} {}\".format(eval_query, sample)\n",
    "\n",
    "  train_df = query_to_dataframe(train_query)\n",
    "  eval_df = query_to_dataframe(eval_query)\n",
    "  return train_df, eval_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"type\": \"service_account\",\n",
      "  \"project_id\": \"cloud-training-demos\",\n",
      "  \"private_key_id\": \"ef88065bb770531b91fb45e31ae60539475547d8\",\n",
      "  \"private_key\": \"-----BEGIN PRIVATE KEY-----\\nMIIEvgIBADANBgkqhk\n",
      "{\n",
      "  \"type\": \"service_account\",\n",
      "  \"project_id\": \"cloud-training-demos\",\n",
      "  \"private_key_id\": \"ef88065bb770531b91fb45e31ae60539475547d8\",\n",
      "  \"private_key\": \"-----BEGIN PRIVATE KEY-----\\nMIIEvgIBADANBgkqhk\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>mother_age</th>\n",
       "      <th>gestation_weeks</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>52962.000000</td>\n",
       "      <td>52962.000000</td>\n",
       "      <td>52962.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>7.221937</td>\n",
       "      <td>27.415996</td>\n",
       "      <td>38.581757</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>1.325003</td>\n",
       "      <td>6.148160</td>\n",
       "      <td>2.580018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.500449</td>\n",
       "      <td>12.000000</td>\n",
       "      <td>17.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>6.563162</td>\n",
       "      <td>23.000000</td>\n",
       "      <td>38.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>7.312733</td>\n",
       "      <td>27.000000</td>\n",
       "      <td>39.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>8.062305</td>\n",
       "      <td>32.000000</td>\n",
       "      <td>40.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>13.232145</td>\n",
       "      <td>53.000000</td>\n",
       "      <td>47.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              label    mother_age  gestation_weeks\n",
       "count  52962.000000  52962.000000     52962.000000\n",
       "mean       7.221937     27.415996        38.581757\n",
       "std        1.325003      6.148160         2.580018\n",
       "min        0.500449     12.000000        17.000000\n",
       "25%        6.563162     23.000000        38.000000\n",
       "50%        7.312733     27.000000        39.000000\n",
       "75%        8.062305     32.000000        40.000000\n",
       "max       13.232145     53.000000        47.000000"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_df, eval_df = create_dataframes(0.001)\n",
    "train_df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>is_male</th>\n",
       "      <th>mother_age</th>\n",
       "      <th>plurality</th>\n",
       "      <th>gestation_weeks</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.124358</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>24</td>\n",
       "      <td>Single</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.562179</td>\n",
       "      <td>false</td>\n",
       "      <td>34</td>\n",
       "      <td>1</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.563077</td>\n",
       "      <td>true</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.522496</td>\n",
       "      <td>false</td>\n",
       "      <td>18</td>\n",
       "      <td>2</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.623908</td>\n",
       "      <td>false</td>\n",
       "      <td>21</td>\n",
       "      <td>1</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      label  is_male  mother_age plurality  gestation_weeks\n",
       "0  1.124358  Unknown          24    Single               17\n",
       "1  0.562179    false          34         1               19\n",
       "2  1.563077     true          18         1               20\n",
       "3  0.522496    false          18         2               20\n",
       "4  0.623908    false          21         1               20"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "eval_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<h2> Creating a scikit-learn model using random forests </h2>\n",
    "\n",
    "Let's train the model locally"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "#%writefile -a babyweight/trainer/model.py\n",
    "def input_fn(indf):\n",
    "  import copy\n",
    "  import pandas as pd\n",
    "  df = copy.deepcopy(indf)\n",
    "\n",
    "  # one-hot encode the categorical columns\n",
    "  df[\"plurality\"] = df[\"plurality\"].astype(pd.api.types.CategoricalDtype(\n",
    "                    categories=[\"Single\",\"Multiple\",\"1\",\"2\",\"3\",\"4\",\"5\"]))\n",
    "  df[\"is_male\"] = df[\"is_male\"].astype(pd.api.types.CategoricalDtype(\n",
    "                  categories=[\"Unknown\",\"false\",\"true\"]))\n",
    "  # features, label\n",
    "  label = df['label']\n",
    "  del df['label']\n",
    "  features = pd.get_dummies(df)\n",
    "  return features, label"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   mother_age  gestation_weeks  is_male_Unknown  is_male_false  is_male_true  \\\n",
      "0          20               17                0              1             0   \n",
      "1          21               17                0              1             0   \n",
      "2          26               17                0              0             1   \n",
      "3          27               17                1              0             0   \n",
      "4          29               17                1              0             0   \n",
      "\n",
      "   plurality_Single  plurality_Multiple  plurality_1  plurality_2  \\\n",
      "0                 0                   0            1            0   \n",
      "1                 0                   0            1            0   \n",
      "2                 0                   0            0            0   \n",
      "3                 0                   1            0            0   \n",
      "4                 1                   0            0            0   \n",
      "\n",
      "   plurality_3  plurality_4  plurality_5  \n",
      "0            0            0            0  \n",
      "1            0            0            0  \n",
      "2            1            0            0  \n",
      "3            0            0            0  \n",
      "4            0            0            0  \n",
      "0    1.000899\n",
      "1    1.437414\n",
      "2    0.518086\n",
      "3    1.344820\n",
      "4    0.551156\n",
      "Name: label, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "train_x, train_y = input_fn(train_df)\n",
    "print(train_x[:5])\n",
    "print(train_y[:5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=5,\n",
       "           max_features='auto', max_leaf_nodes=None,\n",
       "           min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "           min_samples_leaf=1, min_samples_split=2,\n",
       "           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,\n",
       "           oob_score=False, random_state=0, verbose=0, warm_start=False)"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.ensemble import RandomForestRegressor\n",
    "estimator = RandomForestRegressor(max_depth=5, n_estimators=100, random_state=0)\n",
    "estimator.fit(train_x, train_y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[6.10063904 6.10063904 5.19180783 5.20729443 6.10063904]\n",
      "1000    6.563162\n",
      "1001    6.415452\n",
      "1002    5.749656\n",
      "1003    5.500533\n",
      "1004    9.751046\n",
      "Name: label, dtype: float64\n",
      "1.0421166304081477\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "eval_x, eval_y = input_fn(eval_df)\n",
    "eval_pred = estimator.predict(eval_x)\n",
    "print(eval_pred[1000:1005])\n",
    "print(eval_y[1000:1005])\n",
    "print(np.sqrt(np.mean((eval_pred-eval_y)*(eval_pred-eval_y))))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 142,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "#%writefile -a babyweight/trainer/model.py\n",
    "def train_and_evaluate(frac, max_depth=5, n_estimators=100):\n",
    "  import numpy as np\n",
    "\n",
    "  # get data\n",
    "  train_df, eval_df = create_dataframes(frac)\n",
    "  train_x, train_y = input_fn(train_df)\n",
    "  # train\n",
    "  from sklearn.ensemble import RandomForestRegressor\n",
    "  estimator = RandomForestRegressor(max_depth=max_depth, n_estimators=n_estimators, random_state=0)\n",
    "  estimator.fit(train_x, train_y)\n",
    "  # evaluate\n",
    "  eval_x, eval_y = input_fn(eval_df)\n",
    "  eval_pred = estimator.predict(eval_x)\n",
    "  rmse = np.sqrt(np.mean((eval_pred-eval_y)*(eval_pred-eval_y)))\n",
    "  print(\"Eval rmse={}\".format(rmse))\n",
    "  return estimator, rmse"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "#%writefile -a babyweight/trainer/model.py\n",
    "def save_model(estimator, gcspath, name):\n",
    "  from sklearn.externals import joblib\n",
    "  import os, subprocess, datetime\n",
    "  model = 'model.joblib'\n",
    "  joblib.dump(estimator, model)\n",
    "  model_path = os.path.join(gcspath, datetime.datetime.now().strftime(\n",
    "    'export_%Y%m%d_%H%M%S'), model)\n",
    "  subprocess.check_call(['gcloud', 'storage', 'cp', model, model_path])\n",    "  return model_path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "saved = save_model(estimator, 'gs://{}/babyweight/sklearn'.format(BUCKET), 'babyweight')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gs://cloud-training-demos-ml/babyweight/sklearn/export_20180524_233356/babyweight.joblib\n"
     ]
    }
   ],
   "source": [
    "print saved"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Packaging up as a Python package\n",
    "\n",
    "Note the %writefile in the cells above. I uncommented those and ran the cells to write out a model.py\n",
    "The following cell writes out a task.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting babyweight/trainer/task.py\n"
     ]
    }
   ],
   "source": [
    "%writefile babyweight/trainer/task.py\n",
    "# Copyright 2018 Google Inc. All Rights Reserved.\n",
    "#\n",
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "#     http://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License.\n",
    "import argparse\n",
    "import os\n",
    "\n",
    "import hypertune\n",
    "import model\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    parser = argparse.ArgumentParser()\n",
    "    parser.add_argument(\n",
    "        '--bucket',\n",
    "        help = 'GCS path to output.',\n",
    "        required = True\n",
    "    )\n",
    "    parser.add_argument(\n",
    "        '--frac',\n",
    "        help = 'Fraction of input to process',\n",
    "        type = float,\n",
    "        required = True\n",
    "    )\n",
    "    parser.add_argument(\n",
    "        '--maxDepth',\n",
    "        help = 'Depth of trees',\n",
    "        type = int,\n",
    "        default = 5\n",
    "    )\n",
    "    parser.add_argument(\n",
    "        '--numTrees',\n",
    "        help = 'Number of trees',\n",
    "        type = int,\n",
    "        default = 100\n",
    "    )\n",
    "    parser.add_argument(\n",
    "        '--projectId',\n",
    "        help = 'ID (not name) of your project',\n",
    "        required = True\n",
    "    )\n",
    "    parser.add_argument(\n",
    "        '--job-dir',\n",
    "        help = 'output directory for model, automatically provided by gcloud',\n",
    "        required = True\n",
    "    )\n",
    "    \n",
    "    args = parser.parse_args()\n",
    "    arguments = args.__dict__\n",
    "    \n",
    "    model.PROJECT = arguments['projectId']\n",
    "    model.KEYDIR  = 'trainer'\n",
    "    \n",
    "    estimator, rmse = model.train_and_evaluate(arguments['frac'],\n",
    "                                         arguments['maxDepth'],\n",
    "                                         arguments['numTrees']\n",
    "                                        )\n",
    "    loc = model.save_model(estimator, \n",
    "                           arguments['job_dir'], 'babyweight')\n",
    "    print(\"Saved model to {}\".format(loc))\n",
    "    \n",
    "    # this is for hyperparameter tuning\n",
    "    hpt = hypertune.HyperTune()\n",
    "    hpt.report_hyperparameter_tuning_metric(\n",
    "        hyperparameter_metric_tag='rmse',\n",
    "        metric_value=rmse,\n",
    "        global_step=0)\n",
    "\n",
    "# done"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 127,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pandas==0.22.0\n",
      "pandas-gbq==0.3.0\n",
      "pandas-profiling==1.4.1\n",
      "\u001b[33mYou are using pip version 9.0.3, however version 10.0.1 is available.\n",
      "You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!pip freeze | grep pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting babyweight/setup.py\n"
     ]
    }
   ],
   "source": [
    "%writefile babyweight/setup.py\n",
    "# Copyright 2018 Google Inc. All Rights Reserved.\n",
    "#\n",
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "#     http://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License.\n",
    "from setuptools import setup\n",
    "\n",
    "setup(name='trainer',\n",
    "      version='1.0',\n",
    "      description='Natality, with sklearn',\n",
    "      url='http://github.com/GoogleCloudPlatform/training-data-analyst',\n",
    "      author='Google',\n",
    "      author_email='nobody@google.com',\n",
    "      license='Apache2',\n",
    "      packages=['trainer'],\n",
    "      ## WARNING! Do not upload this package to PyPI\n",
    "      ## BECAUSE it contains a private key\n",
    "      package_data={'': ['privatekey.json']},\n",
    "      install_requires=[\n",
    "          'pandas-gbq==0.3.0',\n",
    "          'urllib3',\n",
    "          'google-cloud-bigquery==0.29.0',\n",
    "          'cloudml-hypertune'\n",
    "      ],\n",
    "      zip_safe=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Try out the package on a subset of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "%bash\n",
    "export PYTHONPATH=${PYTHONPATH}:${PWD}/babyweight\n",
    "python -m trainer.task \\\n",
    "   --bucket=${BUCKET} --frac=0.001 --job-dir=gs://${BUCKET}/babyweight/sklearn --projectId $PROJECT"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<h2> Training on Cloud ML Engine </h2>\n",
    "\n",
    "Submit the code to the ML Engine service"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "%bash\n",
    "\n",
    "RUNTIME_VERSION=\"1.8\"\n",
    "PYTHON_VERSION=\"2.7\"\n",
    "JOB_NAME=babyweight_skl_$(date +\"%Y%m%d_%H%M%S\")\n",
    "JOB_DIR=\"gs://$BUCKET/babyweight/sklearn/${JOBNAME}\"\n",
    "\n",
    "gcloud ml-engine jobs submit training $JOB_NAME \\\n",
    "  --job-dir $JOB_DIR \\\n",
    "  --package-path $(pwd)/babyweight/trainer \\\n",
    "  --module-name trainer.task \\\n",
    "  --region us-central1 \\\n",
    "  --runtime-version=$RUNTIME_VERSION \\\n",
    "  --python-version=$PYTHON_VERSION \\\n",
    "  -- \\\n",
    "  --bucket=${BUCKET} --frac=0.1 --projectId $PROJECT"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "The training finished in 20 minutes with a RMSE of 1.05 lbs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<h2> Deploying the trained model </h2>\n",
    "<p>\n",
    "Deploying the trained model to act as a REST web service is a simple gcloud call."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gs://cloud-training-demos-ml/babyweight/sklearn/export_20180526_185457/\n"
     ]
    }
   ],
   "source": [
    "%bash\n",
    "gcloud storage ls gs://${BUCKET}/babyweight/sklearn/ | tail -1"   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "%bash\n",
    "MODEL_NAME=\"babyweight\"\n",
    "MODEL_VERSION=\"skl\"\n",
    "MODEL_LOCATION=$(gcloud storage ls gs://${BUCKET}/babyweight/sklearn/ | tail -1)\n",    "echo \"Deleting and deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION ... this will take a few minutes\"\n",
    "#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME}\n",
    "#gcloud ml-engine models delete ${MODEL_NAME}\n",
    "#gcloud ml-engine models create ${MODEL_NAME} --regions $REGION\n",
    "gcloud alpha ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} \\\n",
    "    --framework SCIKIT_LEARN --runtime-version 1.8  --python-version=2.7"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<h2> Using the model to predict </h2>\n",
    "<p>\n",
    "Send a JSON request to the endpoint of the service to make it predict a baby's weight ... Note that we need to send in an array of numbers in the same order as when we trained the model. You can sort of save some preprocessing by using sklearn's Pipeline, but we did our preprocessing with Pandas, so that is not an option.\n",
    "<p>\n",
    "So, let's find the order of columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index([u'mother_age', u'gestation_weeks', u'is_male_Unknown', u'is_male_false',\n",
      "       u'is_male_true', u'plurality_Single', u'plurality_Multiple',\n",
      "       u'plurality_1', u'plurality_2', u'plurality_3', u'plurality_4',\n",
      "       u'plurality_5'],\n",
      "      dtype='object')\n",
      "[[24, 17, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0], [34, 19, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]]\n"
     ]
    }
   ],
   "source": [
    "data = []\n",
    "for i in range(2):\n",
    "  data.append([])\n",
    "  for col in eval_x:\n",
    "    # convert from numpy integers to standard integers\n",
    "    data[i].append(int(np.uint64(eval_x[col][i]).item()))\n",
    "\n",
    "print(eval_x.columns)\n",
    "print(json.dumps(data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "As long as you send in the data in that order, it will work:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "response={u'predictions': [7.075709854636022, 7.715287844651161]}\n"
     ]
    }
   ],
   "source": [
    "from googleapiclient import discovery\n",
    "from oauth2client.client import GoogleCredentials\n",
    "import json\n",
    "\n",
    "credentials = GoogleCredentials.get_application_default()\n",
    "api = discovery.build('ml', 'v1', credentials=credentials)\n",
    "\n",
    "request_data = {'instances':\n",
    "  # [u'mother_age', u'gestation_weeks', u'is_male_Unknown', u'is_male_0',\n",
    "  #     u'is_male_1', u'plurality_Single', u'plurality_Multiple',\n",
    "  #     u'plurality_1', u'plurality_2', u'plurality_3', u'plurality_4',\n",
    "  #     u'plurality_5']\n",
    "  [[24, 38, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0], \n",
    "   [34, 39, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0]]\n",
    "}\n",
    "\n",
    "parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'babyweight', 'skl')\n",
    "response = api.projects().predict(body=request_data, name=parent).execute()\n",
    "print \"response={0}\".format(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Hyperparameter tuning\n",
    "\n",
    "Let's do a bunch of parallel trials to find good maxDepth and numTrees"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing hyperparam.yaml\n"
     ]
    }
   ],
   "source": [
    "%writefile hyperparam.yaml\n",
    "trainingInput:\n",
    "  hyperparameters:\n",
    "    goal: MINIMIZE\n",
    "    maxTrials: 100\n",
    "    maxParallelTrials: 5\n",
    "    hyperparameterMetricTag: rmse\n",
    "    params:\n",
    "    - parameterName: maxDepth\n",
    "      type: INTEGER\n",
    "      minValue: 2\n",
    "      maxValue: 8\n",
    "      scaleType: UNIT_LINEAR_SCALE\n",
    "    - parameterName: numTrees\n",
    "      type: INTEGER\n",
    "      minValue: 50\n",
    "      maxValue: 150\n",
    "      scaleType: UNIT_LINEAR_SCALE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "%bash\n",
    "RUNTIME_VERSION=\"1.8\"\n",
    "PYTHON_VERSION=\"2.7\"\n",
    "JOB_NAME=babyweight_skl_$(date +\"%Y%m%d_%H%M%S\")\n",
    "JOB_DIR=\"gs://$BUCKET/babyweight/sklearn/${JOBNAME}\"\n",
    "\n",
    "gcloud ml-engine jobs submit training $JOB_NAME \\\n",
    "  --job-dir $JOB_DIR \\\n",
    "  --package-path $(pwd)/babyweight/trainer \\\n",
    "  --module-name trainer.task \\\n",
    "  --region us-central1 \\\n",
    "  --runtime-version=$RUNTIME_VERSION \\\n",
    "  --python-version=$PYTHON_VERSION \\\n",
    "  --config=hyperparam.yaml \\\n",
    "  -- \\\n",
    "  --bucket=${BUCKET} --frac=0.01 --projectId $PROJECT"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "If you go to the GCP console and click on the job, you will see the trial information start to populating, with the lowest rmse trial listed first.  I got the best performance with these settings:\n",
    "<pre>\n",
    "      \"hyperparameters\": {\n",
    "        \"maxDepth\": \"8\",\n",
    "        \"numTrees\": \"90\"\n",
    "      },\n",
    "      \"finalMetric\": {\n",
    "        \"trainingStep\": \"1\",\n",
    "        \"objectiveValue\": 1.03123724461\n",
    "      }\n",
    "</pre>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Train on full dataset\n",
    "\n",
    "Let's train on the full dataset with these hyperparameters.  I am using a larger machine [(8 CPUS, 52 GB of memory)](https://cloud.google.com/ml-engine/docs/tensorflow/machine-types)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing largemachine.yaml\n"
     ]
    }
   ],
   "source": [
    "%writefile largemachine.yaml\n",
    "trainingInput:\n",
    "  scaleTier: CUSTOM\n",
    "  masterType: large_model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "%bash\n",
    "\n",
    "RUNTIME_VERSION=\"1.8\"\n",
    "PYTHON_VERSION=\"2.7\"\n",
    "JOB_NAME=babyweight_skl_$(date +\"%Y%m%d_%H%M%S\")\n",
    "JOB_DIR=\"gs://$BUCKET/babyweight/sklearn/${JOBNAME}\"\n",
    "\n",
    "gcloud ml-engine jobs submit training $JOB_NAME \\\n",
    "  --job-dir $JOB_DIR \\\n",
    "  --package-path $(pwd)/babyweight/trainer \\\n",
    "  --module-name trainer.task \\\n",
    "  --region us-central1 \\\n",
    "  --runtime-version=$RUNTIME_VERSION \\\n",
    "  --python-version=$PYTHON_VERSION \\\n",
    "  --scale-tier=CUSTOM \\\n",
    "  --config=largemachine.yaml \\\n",
    "  -- \\\n",
    "  --bucket=${BUCKET} --frac=1 --projectId $PROJECT --maxDepth 8 --numTrees 90"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Copyright 2018 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
